Microbiome Sample metadata

What is metadata?

  • Metadata is a set of data that describes and provides information about other data. It is commonly defined as data about data.
  • Sample metadata described here refers to the description and context of the individual sample collected for a specific microbiome study.




Metadata structure

  • Metadata collected at different stages are typically organized in an Excel or Google spreadsheet where:
    • The metadata table columns represent the properties of the samples.
    • The table rows contain information associated with the samples.
    • Typically, the first column of sample metadata is Sample ID, which designates the key associated to individual sample
    • Sampl ID must be unique.

Embedded metadata

  • In most cases, you will find the metadata detached from the experimental data.
  • Embedded metadata integrates the experimental data especially for graphics.
  • Major microbiome analysis platforms require sample metadata, commonly referred to as mapping file when performing downstream analysis.

Next, we will start practicing how to download metadata from SRA database.


How to Download NCBI-SRA metadata

Different methods exist for downloading sample metadata deposited in the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA). Each process yields slightly different information, so it is an ideal habit to explore which method gives you what suits you best. For demo: We will explore more on sample metadata retrieved from four randomly selected microbiome BioProjects, including:

  1. PRJNA477349: 16S: rRNA from bushmeat samples collected from Tanzania Metagenome
  2. PRJNA802976: 16S: Changes to Gut Microbiota following Systemic Antibiotic Administration in Infants
  3. PRJNA322554: 16S: The Early Infant Gut Microbiome Varies In Association with a Maternal High-fat Diet
  4. PRJNA937707: 16S: Exploring methods for manipulating both the composition and genomic content of bacteria in the mouse gut
  5. PRJNA589182: 16S: 16S rDNA gene sequencing of the phyllosphere endophytic bacterial communities colonizing wild Populus trichocarpa Raw sequence reads

Manually via SRA Run Selector

We can manually retrieve metadata from the SRA database via the SRA Run Selector.

  • Note that the SRA filename for metadata is automatically named SraRunTable.txt.
  • Users can change the default TXT extension to like CSV if preferred.
  • In our demo, we will use CSV to save the metadata file in the data/metadata/ folder.

Example screen shot of SRA Run Selector for metadata associated with the NCBI-SRA bioproject number PRJNA477349

Computationally via Entrez Direct scripts

#!/bin/bash

esearch -db sra -query 'PRJNA477349[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA477349_metadata.csv;
esearch -db sra -query 'PRJNA802976[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA802976_metadata.csv;
esearch -db sra -query 'PRJNA322554[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA322554_metadata.csv;
esearch -db sra -query 'PRJNA937707[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA937707_metadata.csv;
esearch -db sra -query 'PRJNA589182[bioproject]' | efetch -format runinfo >data/metadata/runinfo_PRJNA589182_metadata.csv;

Computationally using pysradb

The pysradb tool can obtain metadata from SRA and ENA. Here we will create an independent environment and install pysradb. We can delete this env when no longer needed. To learn more click here.

  • First, we create a pysradb environment and install the pysradb tool.
conda activate base
conda create -c bioconda -n pysradb PYTHON=3 pysradb
  • Then we use pysradb to download SRA metadata on CLI like so:
#!/bin/bash
# Shell script: workflow/scripts/pysradb_sra_metadata.sh

pysradb metadata PRJNA477349 --detailed >data/metadata/PRJNA477349_pysradb.csv
pysradb metadata PRJNA802976 --detailed >data/metadata/PRJNA802976_pysradb.csv
pysradb metadata PRJNA322554 --detailed >data/metadata/PRJNA322554_pysradb.csv
pysradb metadata PRJNA937707 --detailed >data/metadata/PRJNA937707_pysradb.csv
pysradb metadata PRJNA589182 --detailed >data/metadata/PRJNA589182_pysradb.csv
  • We can also download SRA metadata using python like so:
# Python script: workflow/scripts/pysradb_sra_metadata.py

import os
import sys
import csv
import pandas as pd

from pysradb.sraweb import SRAweb

db = SRAweb()
df = db.sra_metadata('PRJNA477349', detailed=True)
df.to_csv('data/metadata/PRJNA477349_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA802976', detailed=True)
df.to_csv('data/metadata/PRJNA802976_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA322554', detailed=True)
df.to_csv('data/metadata/PRJNA322554_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA937707', detailed=True)
df.to_csv('data/metadata/PRJNA937707_pysradb_metadata.csv', index=False)

db = SRAweb()
df = db.sra_metadata('PRJNA589182', detailed=True)
df.to_csv('data/metadata/PRJNA589182_pysradb_metadata.csv', index=False)

Querying SRA or ENA with a keyword

Using keywords to search any extensive database helps filter user-specified information, such as certain disease-related studies.

#!/bin/bash

pysradb search --db sra -q Amplicon --max 100 >sra_amplicon_studies.csv

pysradb search --db ena -q Amplicon --max 100 >ena_amplicon_studies.csv

Next, we will explore metadata to learn more about the represented features.


Explore microbiome sample metadata

Exploring variable relationships


Locating sampling points on map

Dropping pins on the map is posible if you have coordinate data, the latitudes and longitude of collection points.






References

[1]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4



Appendix

Project main tree

.
├── LICENSE
├── README.md
├── Rplots.pdf
├── config
│   ├── config.yaml
│   ├── samples.tsv
│   └── units.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   └── metadata
├── ena_amplicon_studies.csv
├── images
│   ├── PRJNA477349_variable_freq.png
│   ├── PRJNA477349_variable_freq.svg
│   ├── bkgd.png
│   ├── geeks.png
│   ├── gpsfiles
│   ├── metadata.png
│   ├── sample_gps.png
│   ├── smkreport
│   └── sra_run_selector.png
├── imap-sample-metadata.Rproj
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── imap.bib
│   └── references.bib
├── report.html
├── resources
├── results
│   ├── PRJNA322554_read_size_asc.csv
│   ├── PRJNA322554_read_size_desc.csv
│   ├── PRJNA322554_srarun_accessions.txt
│   ├── PRJNA477349_read_size_asc.csv
│   ├── PRJNA477349_read_size_desc.csv
│   ├── PRJNA477349_srarun_accessions.txt
│   ├── PRJNA589182_read_size_asc.csv
│   ├── PRJNA589182_read_size_desc.csv
│   ├── PRJNA589182_srarun_accessions.txt
│   ├── PRJNA802976_read_size_asc.csv
│   ├── PRJNA802976_read_size_desc.csv
│   ├── PRJNA802976_srarun_accessions.txt
│   ├── PRJNA937707_read_size_asc.csv
│   ├── PRJNA937707_read_size_desc.csv
│   ├── PRJNA937707_srarun_accessions.txt
│   ├── project_tree.txt
│   └── sample_location.csv
├── sra_amplicon_studies.csv
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── reports
    ├── rules
    ├── schemas
    └── scripts

16 directories, 42 files

Screenshot of interactive snakemake report

The interactive snakemake HTML report can be viewed by opening the report.html using any compatible browser. You will be able to explore the workflow and the associated statistics. You can close the left bar to get a more expansive display view.

Troubleshooting of FAQs

  1. Question
    • Answer
  2. Question
    • Answer